Multiple feed-forward deep neural networks for statistical parametric speech synthesis
نویسندگان
چکیده
In this paper, we investigate a combination of several feedforward deep neural networks (DNNs) for a high-quality statistical parametric speech synthesis system. Recently, DNNs have significantly improved the performance of essential components in the statistical parametric speech synthesis, e.g. spectral feature extraction, acoustic modeling and spectral post-filter. In this paper our proposed technique combines these feed-forward DNNs so that the DNNs can perform all standard steps of the statistical speech synthesis from end to end, including the feature extraction from STRAIGHT spectral amplitudes, acoustic modeling, smooth trajectory generation and spectral post-filter. The proposed DNN-based speech synthesis system is then compared to the state-of-the-art speech synthesis systems, i.e. conventional HMM-based, DNN-based and unit selection ones.
منابع مشابه
TTS synthesis with bidirectional LSTM based recurrent neural networks
Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4]. However, the long time span contextual effect in a speech utterance is still not easy to accommodate, due to the intrinsic, feed-forward nature in DNN-based modeling. Also, to synthesize a smooth speech trajectory, th...
متن کاملInvestigating very deep highway networks for parametric speech synthesis
The depth of the neural network is a vital factor that affects its performance. Recently a new architecture called highway network was proposed. This network facilitates the training process of a very deep neural network by using gate units to control a information highway over the conventional hidden layer. For the speech synthesis task, we investigate the performance of highway networks with ...
متن کاملAcoustic Modeling in Statistical Parametric Speech Synthesis – from Hmm to Lstm-rnn
Statistical parametric speech synthesis (SPSS) combines an acoustic model and a vocoder to render speech given a text. Typically decision tree-clustered context-dependent hidden Markov models (HMMs) are employed as the acoustic model, which represent a relationship between linguistic and acoustic features. Recently, artificial neural network-based acoustic models, such as deep neural networks, ...
متن کاملRecurrent Neural Network Postfilters for Statistical Parametric Speech Synthesis
In the last two years, there have been numerous papers that have looked into using Deep Neural Networks to replace the acoustic model in traditional statistical parametric speech synthesis. However, far less attention has been paid to approaches like DNN-based postfiltering where DNNs work in conjunction with traditional acoustic models. In this paper, we investigate the use of Recurrent Neural...
متن کاملIsolated Word Speech Recognition System Using Deep Neural Networks
Speech recognition is the process of converting speech signals into words. For acoustic modeling HMM-GMM is used for many years. For GMM, it requires assumptions near the data distribution for calculating probabilities. For removing this limitation, GMM is replaced by DNN in acoustic model. Deep neural networks are the feed forward neural networks having more than one or multiple layers of hidd...
متن کامل